Getting Started with EDA on Time Series Data

In this notebook we present some tips and tricks to start with an exploratory data analysis with time series data. The success of any data analysis project depends on the EDA phase. We do not intend to give a complete guide on how to run an eda (especially because it depends on the use-case and the data), but we focus in two important components: (i) Data Visualization (ii) Treating missing values.

Data Source: We are going to consider the data used for the Prophet meetup.

Prepare Notebook

Read Data

Data Preparation

Let us look into missing values.

We just have missing values for the period we want to predict (> July 2021). Since we do not want to do any forecasting we remove this period and generate some date features.

We will focus on the sales variable in this notebook.

Data Visualization

We start by plotting the data. We will mainly use seaborn for data visualization.

Here are some initial observations on the data:

Aggregations

If you want to see a more "global" behaviour you can either (i) aggregate or (ii) smooth. Let us see how the data looks when we sum the sales over the days of the week.

Distributions

Besides the time dependency it is also useful to see the distribution.

The two peaks on the distribution reflect the two levels we see on the data (pre and post 01-01-2020). We can see this better if we color by year.

Seasonality

Now let's generate plots to compare the yearly seasonality.

Next we want to compare the sales development year-over-year for every month.

Most of the times we do not just care about the absolute values but also about the relatiive growth:

Finally, we can also plot the monthly sales share over each year.

Many Variables

As mentioned in the introduction, we are going to focus on the sales data. Nevertheless, we provide code to plot two variables at the same time which do not have the same scale. For comparing variables a correlation analysis might be a good next step.

Missing Values

Now we explore some methods to fill missing values.

Remark:s

Generate Missing Values

We begin by creating a new variable sales_na which is obtained from sales by randomly removing values.

We plot the variable sales_na and a missing values indicator function.

Fill Missnig Values

Now we apply several methods to fill missing values using pandas and scikit-learn.

Let us plot the imputation against the true values.

Results:

Warning: This was a over-simplified way of filling missing values. In practice you need to be very careful when eveluating forecasting models via time-slice-cross-validation as you do not want to leak information. For example, you can not always use the backwards fill as you do not know the value of the next point if it's the last one. Also, even if filling with the mean/meadian works fine (e.g. stationary series), you need to make sure you fill missing values with the mean/median of the training fold and not on the whole data set. It is a good practice to add your imputing method as a step in your pipeline.